nhslogo CS4132 Data Analytics

The Rise (and Fall) of Wordle by Lam Eu Ler¶

Table of Contents¶

  • Motivation & Background
  • Summary of Research Questions
  • Methodology
    • Data Acquisition
    • Data Cleaning
    • EDA
  • Recommendations or Further Works

Motivation and Background¶

Wordle was created in mid-2021 (more specifically, the game refers to 20 June 2021 as the first Wordle) by software engineer John Wardle, who created the game for his girlfriend as he knew she liked word games. The couple played for months, and after it became a hit in family group chats, the game was released to the public in October 2021. The game has since become a worldwide sensation, with millions of people playing it worldwide.

The free-to-play online word guessing game involves players having to guess a 5 letter word in 6 tries. It has a single daily solution, and all players attempt to guess the same word. The game was purchased from creater Josh Wardle by The New York Times Company in January 2022.

As someone who plays Wordle on a daily basis, I was intrigued to explore Wordle in a more analytical fashion. In this project, I will look into Wordle's growth (and decline) and popularity, as well as analyze players' posted results.

Summary of Research Questions¶

  1. How has Wordle's popularity changed since it was first released to the public?
  2. Where is wordle most popular?
  3. What is the average Wordle score?
  4. What makes some Wordles more difficult than others?
  5. Which is the most common starting word and how much does it affect the score?

Methodology¶

pip install pyarrow
pip install plotly
pip install statsmodel

Data Acquisition¶

In [1]:
import numpy as np
import pandas as pd
import pyarrow
from pyarrow import csv as arcsv, Table
In [2]:
def read_csv(filename, dtype=None):
    return arcsv.read_csv(filename, 
        parse_options = arcsv.ParseOptions(newlines_in_values=True),
        convert_options = arcsv.ConvertOptions(column_types=dtype)
    ).to_pandas()
    
def write_csv(data, filename):
    arcsv.write_csv(Table.from_pandas(data), filename)

Tweets¶

Refer to appendix section A.1

In [3]:
read_csv('data/twitter/raw/wordles420.csv').head(20)
Out[3]:
id date content username userLocation sourceLabel
0 1559233311226138628 2022-08-15 17:39:25+00:00 Wordle 420 6/6\n\n⬛🟩⬛⬛⬛\n⬛🟩⬛⬛⬛\n⬛🟩🟩🟩🟩\n⬛🟩🟩🟩🟩\n... IllanMejido Tokyo-3 Twitter Web App
1 1559232918035525632 2022-08-15 17:37:51+00:00 Wordle 420 3/6*\n\nLine 1: 3rd perfect.\nLine ... klchubbuck Seattle, WA Twitter Web App
2 1559215470867189765 2022-08-15 16:28:31+00:00 Wordle 422 6/6\n\n🟨⬜⬜⬜⬜\n🟩🟩⬜⬜⬜\n🟩🟩⬜🟩🟩\n🟩🟩⬜🟩🟩\n... nottodaybrother Grinnell, IA Twitter Web App
3 1559203001620787200 2022-08-15 15:38:58+00:00 Wordle 420 5/6\n\n⬜⬜⬜⬜🟨\n⬜⬜🟩⬜⬜\n⬜⬜🟩🟩🟩\n⬜🟩🟩🟩🟩\n... jocal3 Sammamish Twitter Web App
4 1559190764910952450 2022-08-15 14:50:21+00:00 Wordle 420 5/6\n\n⬜⬜⬜🟨⬜\n🟨⬜⬜🟨⬜\n⬜⬜🟩⬜⬜\n⬜🟩🟩⬜🟨\n... ThePublicHorse Toronto, Ontario TweetDeck
5 1559175718520688643 2022-08-15 13:50:34+00:00 Wordle 420 4/6\n\n⬛⬛⬛🟨⬛\n⬛⬛🟩⬛🟨\n🟩⬛🟩⬛⬛\n🟩🟩🟩🟩🟩 Pyrotastic Miami, FL Twitter Web App
6 1559149586358362113 2022-08-15 12:06:43+00:00 @afterexposure Yeah, your Wordle 420 X/6 was s... HoosAGoodBoy Montréal, Québec Twitter Web App
7 1559149366446792704 2022-08-15 12:05:51+00:00 Saturday, 8/13’s\n\nWordle 420 5/6\n\n⬛⬛⬛⬛🟩\n⬛... ZWordles Twitter for iPhone
8 1559146814535770113 2022-08-15 11:55:42+00:00 Wordle 420 4/6\n\n⬛⬛⬛⬛⬛\n⬛⬛⬛⬛⬛\n⬛🟩⬛⬛⬛\n🟩🟩🟩🟩🟩 DaithiKelleher Dublin, Ireland Twitter for Android
9 1559140700649918464 2022-08-15 11:31:25+00:00 Wordle 420 5/6\n\n⬛🟩⬛⬛⬛\n⬛🟩⬛⬛⬛\n⬛🟩🟩⬛🟩\n⬛🟩🟩🟩🟩\n... wbourgou The cold part of Canada Twitter for Android
10 1559115522356441088 2022-08-15 09:51:22+00:00 Wordle 420 5/6\n\n⬛⬛⬛⬛🟨\n⬛⬛⬛🟨⬛ \n⬛🟨🟨⬛🟨\n⬛⬛🟨🟨🟨\... sykedelic_3406 India Twitter Web App
11 1559079102048452608 2022-08-15 07:26:38+00:00 Wordle 420 5/6*\n\n⬛⬛🟨⬛⬛\n⬛🟨⬛🟩⬛\n🟩🟩⬛🟩🟩\n🟩🟩⬛🟩🟩\... manishgant San Francisco Bay Area Twitter for iPhone
12 1559063471551037441 2022-08-15 06:24:32+00:00 Wordle 420 4/6*\n\n⬛⬛⬛🟨⬛\n⬛🟨🟨⬛⬛\n🟩🟩⬛🟩🟩\n🟩🟩🟩🟩🟩 Natalie54056779 Wednesfield Twitter for Android
13 1559053049406144512 2022-08-15 05:43:07+00:00 @cougsgo @JasonPuckett20 @_MargaretLarson Word... sddebruler ÜT: 48.453574,-122.324751 Twitter for Android
14 1559023698232987649 2022-08-15 03:46:29+00:00 I haven't posted a #Wordle in a little while, ... AlexiSargeant Manhattan, NY Twitter Web App
15 1559018301325512704 2022-08-15 03:25:02+00:00 Current Streak:20\n\nWordle 420 6/6\n\n⬜⬜⬜⬜🟨\n... SideMoss5126524 Twitter for iPad
16 1559013629319684097 2022-08-15 03:06:28+00:00 Wordle 420 4/6*\n\n⬛⬛⬛⬛⬛\n⬛🟨⬛⬛⬛\n⬛🟨⬛⬛🟨\n🟩🟩🟩🟩🟩 AlexGodofsky Fairfax, VA Twitter Web App
17 1559011999937433600 2022-08-15 03:00:00+00:00 Wordle 420 X/6\n\n⬛⬛⬛⬛🟩\n⬛🟨⬛🟨⬛\n🟩🟩⬛⬛🟩\n🟩🟩⬛⬛🟩\n... s_kwkm http://uraraka.work/ Twitter Web App
18 1558998211918012416 2022-08-15 02:05:13+00:00 Well, crap. A lesson not to do wordle while di... SarahAnnMasse LA & NYC Twitter Web App
19 1558980696936308736 2022-08-15 00:55:37+00:00 Catch-up! There were 2 Saturday sedecordle sol... Xanthe_Cat Australia. Naarm, VIC 3018 Twitter Web App

Reddit comments¶

Refer to appendix section A.4

In [4]:
read_csv('data/reddit/comments.csv').head(20)
Out[4]:
author body score created_utc
0 lavinient Scoredle 439 4/6* \n\n14,855 \n⬛⬛🟨⬛⬛ >!SLIME... 26 1662010774
1 BlitzAceSamy Scoredle 439 3/6* \n\n|Guess|Result|Scoredle|... 11 1661943874
2 eliw23 Scoredle 439 5/6* \n\n12,974 \n⬜⬜⬜🟨⬜ >!STAIR... 9 1662007274
3 blackbeanqueen Scoredle 439 3/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!SLATE... 8 1662065267
4 MidnightExcursion Did you hear about the >!mushroom!< that got k... 7 1662019442
5 BlackJPI Scoredle 439 4/6* \n\n14,855 \n⬜⬜⬜⬜⬜ >!STEAM... 8 1662029435
6 cheezislife Scoredle 439 4/6* \n\n14,855 \n🟩⬛⬛⬛⬛ >!FLASH... 6 1662067855
7 HS007 Scoredle 439 6/6 \n\n14,855 \n* ⬛🟨⬛🟨⬛ >!PIOU... 5 1661972028
8 Pinnacle55 Scoredle 439 3/6* \n\n14,855 \n⬛⬛⬛🟨⬛ >!STONE... 5 1662020156
9 Inna_Bien >!words that end with i are a special kind of ... 5 1662022149
10 shombled Scoredle 439 3/6* \n\n14,855 \n⬛⬛⬛⬛⬛ >!STARE... 4 1662006454
11 LadeeAlana People say that I'm kind of a >!FUN GUY!< myself. 3 1662021326
12 TaurAnder Scoredle 439 3/6* \n\n14,855 \n* 🟩⬛🟨⬛⬛ >!FLI... 4 1662022421
13 cyrano4833 Scoredle 439 5/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!STORE... 3 1662028951
14 oglabradoodle Scoredle 439 4/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!STEAM... 5 1662040052
15 a-little-bit-this Scoredle 439 4/6* \n\n14,855 \n* ⬜⬜⬜🟨⬜ >!ALO... 3 1662054311
16 mohamedzu Scoredle 439 3/6* \n\n14,855 \n🟩🟨🟨⬛⬛ >!FIGHT... 4 1662055968
17 KingD123 Scoredle 439 3/6* \n\n12,974 \n⬛⬛⬛🟨⬛ >!PLANK... 4 1662068789
18 soleilady Scoredle 439 4/6* \n\n12,974 \n⬜⬜🟨⬜⬜ >!ARISE... 3 1662072125
19 calscigal Scoredle 439 3/6* \n\n12,974 \n* ⬛⬛⬛⬛⬛ >!CHA... 4 1662075325

Interest over time - Google Trends¶

In [5]:
pd.read_csv('data/trends/trends.csv').head(20)
Out[5]:
date Wordle
0 2021-12-26 0
1 2021-12-27 0
2 2021-12-28 0
3 2021-12-29 0
4 2021-12-30 0
5 2021-12-31 1
6 2022-01-01 1
7 2022-01-02 1
8 2022-01-03 1
9 2022-01-04 4
10 2022-01-05 3
11 2022-01-06 4
12 2022-01-07 6
13 2022-01-08 7
14 2022-01-09 8
15 2022-01-10 8
16 2022-01-11 10
17 2022-01-12 14
18 2022-01-13 14
19 2022-01-14 15

Interest by region - Google Trends¶

In [6]:
pd.read_csv('data/trends/regional.csv').head(20)
Out[6]:
geoName geoCode Wordle
0 Afghanistan AF 0
1 Albania AL 2
2 Algeria DZ 0
3 American Samoa AS 0
4 Andorra AD 17
5 Angola AO 0
6 Anguilla AI 0
7 Antarctica AQ 0
8 Antigua & Barbuda AG 15
9 Argentina AR 4
10 Armenia AM 1
11 Aruba AW 24
12 Australia AU 55
13 Austria AT 5
14 Azerbaijan AZ 1
15 Bahamas BS 14
16 Bahrain BH 8
17 Bangladesh BD 2
18 Barbados BB 17
19 Belarus BY 0

Wordle words¶

Refer to appendix section A.3.1

In [7]:
pd.read_csv('data/words/answers.csv', index_col='n').head(20)
Out[7]:
answers date
n
179 trace 2021-12-15
180 using 2021-12-16
181 peach 2021-12-17
182 champ 2021-12-18
183 baton 2021-12-19
184 brake 2021-12-20
185 pluck 2021-12-21
186 craze 2021-12-22
187 gripe 2021-12-23
188 weary 2021-12-24
189 picky 2021-12-25
190 acute 2021-12-26
191 ferry 2021-12-27
192 aside 2021-12-28
193 tapir 2021-12-29
194 troll 2021-12-30
195 unify 2021-12-31
196 rebus 2022-01-01
197 boost 2022-01-02
198 truss 2022-01-03

Wordle answer word frequency¶

Refer to appendix section A.3.2

In [8]:
pd.read_csv('data/words/freq.csv', index_col=0).head(20)
Out[8]:
answers freq zipf
n
179 trace 1.550000e-05 4.19
180 using 2.950000e-04 5.47
181 peach 5.890000e-06 3.77
182 champ 7.940000e-06 3.90
183 baton 4.370000e-06 3.64
184 brake 8.910000e-06 3.95
185 pluck 1.350000e-06 3.13
186 craze 2.240000e-06 3.35
187 gripe 6.460000e-07 2.81
188 weary 4.070000e-06 3.61
189 picky 2.190000e-06 3.34
190 acute 1.000000e-05 4.00
191 ferry 1.050000e-05 4.02
192 aside 3.890000e-05 4.59
193 tapir 1.950000e-07 2.29
194 troll 5.370000e-06 3.73
195 unify 1.170000e-06 3.07
196 rebus 3.390000e-07 2.53
197 boost 2.570000e-05 4.41
198 truss 1.950000e-06 3.29

ISO 3166-1 - Wikipedia¶

Refer to appendix section A.2.1

https://en.wikipedia.org/wiki/ISO_3166-1

In [9]:
pd.read_csv('data/geo/iso.csv', index_col=0).head(20)
Out[9]:
alpha2 alpha3
numeric
4 AF AFG
248 AX ALA
8 AL ALB
12 DZ DZA
16 AS ASM
20 AD AND
24 AO AGO
660 AI AIA
10 AQ ATA
28 AG ATG
32 AR ARG
51 AM ARM
533 AW ABW
36 AU AUS
40 AT AUT
31 AZ AZE
44 BS BHS
48 BH BHR
50 BD BGD
52 BB BRB

List of alternative country names - Wikipedia¶

Refer to appendix section A.2.2

https://en.wikipedia.org/wiki/List_of_alternative_country_names

In [10]:
pd.read_csv('data/geo/countries.csv').head(20)
Out[10]:
numeric name
0 4 Afghanistan
1 24 Angola
2 24 Republic of Angola
3 24 República de Angola
4 8 Albania
5 8 Republic of Albania
6 8 Republika e Shqipërisë
7 20 Andorra
8 20 Principality of Andorra
9 20 Principat d'Andorra
10 784 Al Emirat al Arabbiya al Muttahida
11 784 The Emirates
12 784 Trucial Arabia
13 784 United Arab Emirates
14 32 Argentina
15 32 Argentine Nation
16 32 Argentine Republic
17 32 la Argentina
18 51 Armenia
19 51 Hayastani Hanrapetut’yun

Country flag emojis¶

Refer to appendix section A.2.3

In [11]:
pd.read_csv('data/geo/flags.csv').head(20)
Out[11]:
numeric flag
0 4 🇦🇫
1 248 🇦🇽
2 8 🇦🇱
3 12 🇩🇿
4 16 🇦🇸
5 20 🇦🇩
6 24 🇦🇴
7 660 🇦🇮
8 10 🇦🇶
9 28 🇦🇬
10 32 🇦🇷
11 51 🇦🇲
12 533 🇦🇼
13 36 🇦🇺
14 40 🇦🇹
15 31 🇦🇿
16 44 🇧🇸
17 48 🇧🇭
18 50 🇧🇩
19 52 🇧🇧

Number of internet users by country - Our World in Data¶

https://ourworldindata.org/grapher/number-of-internet-users-by-country

In [12]:
pd.read_csv('data/geo/internet_users.csv').head(20)
Out[12]:
Entity Code Year Number of internet users (OWID based on WB & UN)
0 Afghanistan AFG 1990 0
1 Afghanistan AFG 2001 990
2 Afghanistan AFG 2002 1003
3 Afghanistan AFG 2003 20272
4 Afghanistan AFG 2004 25520
5 Afghanistan AFG 2005 306904
6 Afghanistan AFG 2006 545607
7 Afghanistan AFG 2007 505719
8 Afghanistan AFG 2008 502210
9 Afghanistan AFG 2009 994154
10 Afghanistan AFG 2010 1152127
11 Afghanistan AFG 2011 1485430
12 Afghanistan AFG 2012 1674380
13 Afghanistan AFG 2013 1872170
14 Afghanistan AFG 2014 2293061
15 Afghanistan AFG 2015 2786634
16 Afghanistan AFG 2016 3672058
17 Albania ALB 1990 0
18 Albania ALB 1995 347
19 Albania ALB 1996 996

Major cities of the world - Datahub¶

https://datahub.io/core/world-cities

In [13]:
pd.read_csv('data/geo/subcountries.csv').head(20)
Out[13]:
name country subcountry geonameid
0 les Escaldes Andorra Escaldes-Engordany 3040051
1 Andorra la Vella Andorra Andorra la Vella 3041563
2 Umm al Qaywayn United Arab Emirates Umm al Qaywayn 290594
3 Ras al-Khaimah United Arab Emirates Raʼs al Khaymah 291074
4 Khawr Fakkān United Arab Emirates Ash Shāriqah 291696
5 Dubai United Arab Emirates Dubai 292223
6 Dibba Al-Fujairah United Arab Emirates Al Fujayrah 292231
7 Dibba Al-Hisn United Arab Emirates Al Fujayrah 292239
8 Sharjah United Arab Emirates Ash Shāriqah 292672
9 Ar Ruways United Arab Emirates Abu Dhabi 292688
10 Al Fujayrah United Arab Emirates Al Fujayrah 292878
11 Al Ain United Arab Emirates Abu Dhabi 292913
12 Ajman United Arab Emirates Ajman 292932
13 Adh Dhayd United Arab Emirates Ash Shāriqah 292953
14 Abu Dhabi United Arab Emirates Abu Dhabi 292968
15 Zaranj Afghanistan Nīmrūz 1120985
16 Taloqan Afghanistan Takhār 1123004
17 Shīnḏanḏ Afghanistan Herat 1125155
18 Shibirghān Afghanistan Jowzjān 1125444
19 Shahrak Afghanistan Ghowr 1125896

World City Populations 2022 - World Population Review¶

https://worldpopulationreview.com/world-cities

In [14]:
pd.read_csv('data/geo/cities.csv').head(20)
Out[14]:
rank Name Country Population Prev Growth
0 1 Tokyo Japan 37274000 37339804 -0.0018
1 2 Delhi India 32065760 31181376 0.0284
2 3 Shanghai China 28516904 27795702 0.0259
3 4 Dhaka Bangladesh 22478116 21741090 0.0339
4 5 Sao Paulo Brazil 22429800 22237472 0.0086
5 6 Mexico City Mexico 22085140 21918936 0.0076
6 7 Cairo Egypt 21750020 21322750 0.0200
7 8 Beijing China 21333332 20896820 0.0209
8 9 Mumbai India 20961472 20667656 0.0142
9 10 Osaka Japan 19059856 19110616 -0.0027
10 11 Chongqing China 16874740 16382376 0.0301
11 12 Karachi Pakistan 16839950 16459472 0.0231
12 13 Istanbul Turkey 15636243 15415197 0.0143
13 14 Kinshasa DR Congo 15628085 14970460 0.0439
14 15 Lagos Nigeria 15387639 14862111 0.0354
15 16 Buenos Aires Argentina 15369919 15257673 0.0074
16 17 Kolkata India 15133888 14974073 0.0107
17 18 Manila Philippines 14406059 14158573 0.0175
18 19 Tianjin China 14011828 13794450 0.0158
19 20 Guangzhou China 13964637 13635397 0.0241

List of State Abbreviations - World Population Review¶

https://worldpopulationreview.com/states/state-abbreviations

In [15]:
pd.read_csv('data/geo/states.csv').head(20)
Out[15]:
State Abbrev Code
0 Alabama Ala. AL
1 Alaska Alaska AK
2 Arizona Ariz. AZ
3 Arkansas Ark. AR
4 California Calif. CA
5 Colorado Colo. CO
6 Connecticut Conn. CT
7 Delaware Del. DE
8 District of Columbia D.C. DC
9 Florida Fla. FL
10 Georgia Ga. GA
11 Hawaii Hawaii HI
12 Idaho Idaho ID
13 Illinois Ill. IL
14 Indiana Ind. IN
15 Iowa Iowa IA
16 Kansas Kans. KS
17 Kentucky Ky. KY
18 Louisiana La. LA
19 Maine Maine ME

Data Cleaning¶

In [16]:
import re

Tweets¶

When a game of wordle is completed, the player is able to easily share their results by copying a grid of emoji squares generated by the game. This allows players to share their game with others without spoiling it. These cryptic emoji squares have been shared all over the internet, especially on Twitter, as seen in the tweets below.

Wordle 420 X/6

⬛🟧⬛⬛⬛
⬛🟧⬛⬛⬛
⬛🟧⬛🟦🟧
🟧🟧⬛⬛🟧
🟧🟧⬛⬛🟧
🟧🟧⬛🟧🟧

this was so rude

— 🎃Hogueus Woahcus 🎃 (@HogueLikeWoah) August 14, 2022

Again with the rhyming game. I lucked out this time!#Wordle420 4/6*

⬛⬛⬛🟨⬛
⬛⬛🟩⬛🟩
⬛🟩🟩🟩🟩
🟩🟩🟩🟩🟩 pic.twitter.com/cySnOsjYSg

— Gayla (@AZGayla) August 13, 2022

The shared text follows a standard structure, with a header displaying the game number and score, followed by the emoji grid. Each row represents a guess made by the user and the emojis show how good the guess was. The header also includes an asterisk when playing on hard mode, and the emoji colours depend on whether the user is on dark or light theme, and whether they have high contrast mode on.

In [17]:
# process 1 day of tweets and write to file
def process_twitter(n):
    posts = read_csv(f'data/twitter/raw/wordles{n}.csv')# Load data
    
    # regex to match shared wordle games
    regex = ''.join([
        # match header
        rf'Wordle[^\n]*(?P<wordle>{n})[^\n]+(?P<score>[0-6X])\/6(?P<hard>\*)?',
        # (not) looking for links as many wordle "clones" link to their website
        r'(?![^🟥🟧🟨🟩🟦🟪⬛⬜🟫]*wordle[\w\-\.]*\.[a-z]{2,})',
        r'[^🟥🟧🟨🟩🟦🟪⬛⬜🟫]*',
        # emoji square grid
        r'(?P<grid>(?:[^🟥🟧🟨🟩🟦🟪⬛⬜🟫\n]*[🟨🟩🟧🟦⬛⬜]{5}[^🟥🟧🟨🟩🟦🟪⬛⬜🟫\n]*\n?){1,6})',
        r'(?!\n?[🟥🟧🟨🟩🟦🟪⬛⬜🟫])',
        # (not) looking for links
        r'(?![^🟥🟧🟨🟩🟦🟪⬛⬜🟫\/]*wordle[\w\-\.]*\.[a-z]{2,})'
    ])
    
    # remove non-standard characters to allow proper matching
    # such as invisible characters like U+FE0F "VARIATION SELECTOR-16" 
    # which often follow some emojis
    scores = posts['content'].str.replace(
        r'[^\w\s\n\/\.\-\*🟥🟧🟨🟩🟦🟪⬛⬜🟫]','', regex=True
    ).str.extract(regex, flags=re.IGNORECASE) # run regex extraction
    
    # return if their are no matches in the tweets
    if scores.isna().all().all():
        return

    # copy over data from raw posts data to the regex extracted data
    scores[['location','source']] = posts[['userLocation','sourceLabel']].astype('string')
    scores['username'] = posts['username'].astype('string')
    
    # remove non matches
    scores = scores.dropna(subset='wordle').reset_index(drop=True)

    # type conversion to save memory
    scores['wordle'] = scores['wordle'].astype('int16')
    scores['score'] = scores['score'].replace(['x','X'],0).astype('int8')
    scores['hard'] = scores['hard'] == '*'
    
    scores['dark'] = ~scores['grid'].str.contains('⬜')
    scores['contrast'] = ~scores['grid'].str.contains('🟩')
    
    grid = scores[scores['grid'].notnull()]['grid'].str.split(
        '[^🟥🟧🟨🟩🟦🟪⬛⬜🟫]+', expand=True,n=6 # separate each row of emojis
        ).replace('',pd.NA).reindex( # Convert all blank values to <NA>
            columns = range(0,6)
        ).astype('string')
    
    # converting the emoji rows to base3 number to reduce memory
    emojis = {
    '⬛':0, '⬜':0, # absent letter
    '🟦':1, '🟨':1, # present letter
    '🟧':2, '🟩':2  # correct letter
    }

    # convert emojis to base3
    def encodeGuess(guess):
        ans = 0
        if(pd.notna(guess)):
            for n in map(emojis.get, guess):
                ans = ans * 3 + n
            ans += 1
        return ans

    # iterate through each column
    def encodeColumn(col):
        col[col.str.contains('🟩').fillna(False)].str.replace(
            '🟧','🟨',regex=False # replace non standard emoji use
        )
        return col.apply(encodeGuess)
    
    grid = grid.apply(encodeColumn).astype('uint8')
    
    scores = pd.concat([scores, grid], axis=1).drop(columns='grid')
    # stop pyarrow from complaining about mixed type column names
    scores.columns = scores.columns.astype(str)
    
    # save to csv
    write_csv(scores, f'data/twitter/processed/wordles{n}.csv')

Location data¶

ISO codes¶

In [18]:
geo_iso = pd.read_csv('data/geo/iso.csv', index_col=0)
geo_iso.head(20)
Out[18]:
alpha2 alpha3
numeric
4 AF AFG
248 AX ALA
8 AL ALB
12 DZ DZA
16 AS ASM
20 AD AND
24 AO AGO
660 AI AIA
10 AQ ATA
28 AG ATG
32 AR ARG
51 AM ARM
533 AW ABW
36 AU AUS
40 AT AUT
31 AZ AZE
44 BS BHS
48 BH BHR
50 BD BGD
52 BB BRB

Internet users by country¶

In [19]:
geo_internet = pd.read_csv('data/geo/internet_users.csv')
# Reduce dataset to 1 value per country
geo_internet = geo_internet.groupby('Code').max()[
    ['Entity','Number of internet users (OWID based on WB & UN)']
]
# Convert alpha3 code to numeric code
geo_internet = pd.merge(
    geo_iso['alpha3'].reset_index(), 
    geo_internet, 
    left_on='alpha3', 
    right_index=True, 
    how='right'
).drop('alpha3', axis=1).reset_index(drop=True)
# Rename columns
geo_internet.columns = ['numeric','country', 'users']
In [20]:
geo_internet.info()
geo_internet[geo_internet['numeric'].isnull()]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192 entries, 0 to 191
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   numeric  191 non-null    float64
 1   country  192 non-null    object 
 2   users    192 non-null    int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 4.6+ KB
Out[20]:
numeric country users
133 NaN World 3419398061
In [21]:
# Drop 'World' as it is not a country
geo_internet.dropna(inplace=True)
# Downcast the numeric code (from float64) to int16
geo_internet['numeric'] = pd.to_numeric(geo_internet['numeric'], downcast='integer')
geo_internet.info()
geo_internet.head(20)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 191
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   numeric  191 non-null    int16 
 1   country  191 non-null    object
 2   users    191 non-null    int64 
dtypes: int16(1), int64(1), object(1)
memory usage: 4.8+ KB
Out[21]:
numeric country users
0 533 Aruba 98053
1 4 Afghanistan 3672058
2 24 Angola 3745750
3 8 Albania 1942025
4 784 United Arab Emirates 8913210
5 32 Argentina 31118074
6 51 Armenia 1882003
7 28 Antigua and Barbuda 73703
8 36 Australia 21288325
9 40 Austria 7681572
10 31 Azerbaijan 7763795
11 108 Burundi 544479
12 56 Belgium 10021205
13 204 Benin 1304029
14 854 Burkina Faso 2603042
15 50 Bangladesh 29733670
16 100 Bulgaria 4492334
17 48 Bahrain 1431062
18 44 Bahamas 312986
19 70 Bosnia and Herzegovina 2437067

Alternative country names¶

In [22]:
geo_countries = pd.read_csv('data/geo/countries.csv')
# Sort country names by number of internet users
geo_countries = geo_countries.merge(
    # Combine with geo_internet to do sort
    geo_internet[['numeric','users']], 
    on='numeric', 
    how='left'
).fillna(0).sort_values( # do sort
    'users', ascending=False
).drop('users', axis=1).reset_index(drop=True)
geo_countries.head(20)
Out[22]:
numeric name
0 156 中华人民共和国
1 156 Mainland China
2 156 Red China
3 156 People's Republic of China
4 156 PRC
5 156 中国內地
6 156 中国大陆
7 156 中国
8 156 New China
9 156 Communist China
10 156 China
11 156 新中国
12 156 中共
13 356 भारतवर्ष
14 356 ഭാരതം
15 356 भारत
16 356 Republic of India
17 356 India
18 356 பாரதம்
19 356 Bhāratam भारतम्

Country flag emojis¶

In [23]:
geo_flags = pd.read_csv('data/geo/flags.csv')
# Sort flag emojis by number of internet users
# same process as for country names
geo_flags = geo_flags.merge(
        geo_internet[['numeric','users']], 
        on='numeric', how='left'
    ).fillna(0).sort_values(
        'users', ascending=False
    ).drop(
        'users', axis=1
    ).reset_index(drop=True)
geo_flags.head(20)
Out[23]:
numeric flag
0 156 🇨🇳
1 356 🇮🇳
2 840 🇺🇸
3 76 🇧🇷
4 392 🇯🇵
5 643 🇷🇺
6 360 🇮🇩
7 484 🇲🇽
8 276 🇩🇪
9 826 🇬🇧
10 608 🇵🇭
11 250 🇫🇷
12 792 🇹🇷
13 364 🇮🇷
14 410 🇰🇷
15 566 🇳🇬
16 704 🇻🇳
17 818 🇪🇬
18 724 🇪🇸
19 764 🇹🇭

Subcountries of countries¶

In [24]:
geo_subcountries = pd.read_csv('data/geo/subcountries.csv').dropna()
geo_subcountries = pd.DataFrame(
        # Get all unique pairs of country and subcountry
        list(set(zip(
            geo_subcountries['country'],
            geo_subcountries['subcountry']
        ))), 
        columns=['country','subcountry']
    ).merge(
        # Get numeric code from geo_countries
        geo_countries, left_on='country', right_on='name'
    ).drop('name',axis=1)
geo_subcountries.head(20)
Out[24]:
country subcountry numeric
0 Russia Tula 643
1 Russia Kabardino-Balkariya 643
2 Russia Kamtsjatka 643
3 Russia Tatarstan 643
4 Russia Vologda 643
5 Russia Arkhangelskaya 643
6 Russia Brjansk 643
7 Russia Lipetsk 643
8 Russia Rostov 643
9 Russia Smolensk 643
10 Russia Nenetskiy Avtonomnyy Okrug 643
11 Russia Karachayevo-Cherkesiya 643
12 Russia Sverdlovsk 643
13 Russia Kursk 643
14 Russia Omsk 643
15 Russia Stavropol'skiy 643
16 Russia Respublika Buryatiya 643
17 Russia Jaroslavl 643
18 Russia Mariy-El 643
19 Russia Krasnodarskiy 643

Cities in countries¶

In [25]:
geo_cities = pd.read_csv('data/geo/cities.csv')
geo_cities.loc[geo_cities['Country']=='DR Congo', 'Country'] = 'Congo' # Special case
geo_cities = geo_cities.merge(
        # Get numeric code from geo_countries
        geo_countries, left_on='Country', right_on='name', how='inner'
    ).drop('name',axis=1).sort_values(
        # Sort by population size
        by='Population', ascending=False
    # drop NA values, such as for regions which are not countries
    ).reset_index(drop=True).dropna()
geo_cities.head(20)
Out[25]:
rank Name Country Population Prev Growth numeric
0 1 Tokyo Japan 37274000 37339804 -0.0018 392
1 2 Delhi India 32065760 31181376 0.0284 356
2 3 Shanghai China 28516904 27795702 0.0259 156
3 4 Dhaka Bangladesh 22478116 21741090 0.0339 50
4 5 Sao Paulo Brazil 22429800 22237472 0.0086 76
5 6 Mexico City Mexico 22085140 21918936 0.0076 484
6 7 Cairo Egypt 21750020 21322750 0.0200 818
7 8 Beijing China 21333332 20896820 0.0209 156
8 9 Mumbai India 20961472 20667656 0.0142 356
9 10 Osaka Japan 19059856 19110616 -0.0027 392
10 11 Chongqing China 16874740 16382376 0.0301 156
11 12 Karachi Pakistan 16839950 16459472 0.0231 586
12 13 Istanbul Turkey 15636243 15415197 0.0143 792
13 14 Kinshasa Congo 15628085 14970460 0.0439 178
14 14 Kinshasa Congo 15628085 14970460 0.0439 178
15 15 Lagos Nigeria 15387639 14862111 0.0354 566
16 16 Buenos Aires Argentina 15369919 15257673 0.0074 32
17 17 Kolkata India 15133888 14974073 0.0107 356
18 18 Manila Philippines 14406059 14158573 0.0175 608
19 19 Tianjin China 14011828 13794450 0.0158 156

US States¶

In [26]:
geo_states = pd.read_csv('data/geo/states.csv')
geo_states.head(20)
Out[26]:
State Abbrev Code
0 Alabama Ala. AL
1 Alaska Alaska AK
2 Arizona Ariz. AZ
3 Arkansas Ark. AR
4 California Calif. CA
5 Colorado Colo. CO
6 Connecticut Conn. CT
7 Delaware Del. DE
8 District of Columbia D.C. DC
9 Florida Fla. FL
10 Georgia Ga. GA
11 Hawaii Hawaii HI
12 Idaho Idaho ID
13 Illinois Ill. IL
14 Indiana Ind. IN
15 Iowa Iowa IA
16 Kansas Kans. KS
17 Kentucky Ky. KY
18 Louisiana La. LA
19 Maine Maine ME

Getting tweet locations¶

The location information of twitter users can be extracted by the location field of their profile. However, not all users opt to specify their location. In addition, as it is just a text field, not all users actually put their location, and it is much harder to accurately determine their location.

In [27]:
dtypes = pd.read_csv('data/twitter/processed/dtypes.csv', index_col=0).iloc[:,0].to_dict()
            
df_twitter = pyarrow.concat_tables([arcsv.read_csv(f'data/twitter/processed/wordles{n}.csv', 
                parse_options = arcsv.ParseOptions(newlines_in_values=True),
                convert_options = arcsv.ConvertOptions(column_types=dtypes)
            ) for n in range(190,440)]).to_pandas()
            
sources = pd.Series(df_twitter['source'].value_counts(sort=True).index)
df_twitter['source'] = pd.to_numeric(df_twitter['source'].map(pd.Series(sources.index, index=sources)), downcast='integer')
df_twitter.head(20)
Out[27]:
wordle score hard location source username dark contrast 0 1 2 3 4 5
0 190 4 False 1 cafehearts True False 91 33 166 243 0 0
1 190 4 False New Zealand 0 MJMcVeigh False False 40 112 220 243 0 0
2 190 4 False lenape territory / nyc 1 jmarieray False False 121 94 229 243 0 0
3 190 5 False Seaton Carew, England 4 Olucaron False False 91 41 121 229 243 0
4 191 2 False Wellington, New Zealand 1 BarristerNZ True False 27 243 0 0 0 0
5 191 5 False 💉💉💉 on Ngunnawal country 1 KBCanB False False 7 90 81 81 243 0
6 191 6 False New Zealand 0 MJMcVeigh False False 136 56 73 237 237 243
7 191 4 False 1 cafehearts True False 8 1 180 243 0 0
8 192 3 False Dallas, TX 1 KevinPayravi True False 12 165 243 0 0 0
9 192 4 False 방 의자 2 onezero42O False False 109 12 222 243 0 0
10 192 5 False Canberra 12 Chriplodocus True False 38 94 41 168 243 0
11 192 0 False 1 aquaebn False False 2 32 8 32 6 1
12 192 3 False Auckland Central, Auckland 3 kathyscott25 False False 94 53 243 0 0 0
13 192 5 False 1 alejandroechev_ False False 28 94 220 220 243 0
14 192 5 False Santiago, Chile 0 jpalz True False 32 13 109 220 243 0
15 192 4 False San Diego 2 badalj False False 10 47 21 243 0 0
16 192 3 False crossing a bridge 2 semirose False False 102 24 243 0 0 0
17 192 5 False australia • they/them • bi 0 eboyuchiha True False 93 193 194 189 243 0
18 192 4 False 065 067 066 1 CanDoMarxist True False 30 183 186 243 0 0
19 192 2 False Victoria, Australia 1 crazythainame False False 93 243 0 0 0 0
In [28]:
# Get all unique locations of twitter users
twt_locations = pd.DataFrame(df_twitter['location'].dropna().str.strip().unique(), columns=['location'])
twt_locations.head(20)
Out[28]:
location
0
1 New Zealand
2 lenape territory / nyc
3 Seaton Carew, England
4 Wellington, New Zealand
5 💉💉💉 on Ngunnawal country
6 Dallas, TX
7 방 의자
8 Canberra
9 Auckland Central, Auckland
10 Santiago, Chile
11 San Diego
12 crossing a bridge
13 australia • they/them • bi
14 065 067 066
15 Victoria, Australia
16 A couple of steps off the pace
17 Warrnambool, Victoria
18 Hobart, Tasmania
19 Melbourne, Australia
In [29]:
# combine all the dataframes used to match terms to countries
geo_names = pd.concat([
        geo_countries,
        # renaming columns for consistency & concat
        geo_subcountries.rename(columns={'subcountry':'name'})[['numeric','name']],
        geo_cities.rename(columns={'Name':'name'})[['numeric','name']]
    ])
geo_names['name'] = geo_names['name'].str.lower()
geo_names.head(20)
Out[29]:
numeric name
0 156 中华人民共和国
1 156 mainland china
2 156 red china
3 156 people's republic of china
4 156 prc
5 156 中国內地
6 156 中国大陆
7 156 中国
8 156 new china
9 156 communist china
10 156 china
11 156 新中国
12 156 中共
13 356 भारतवर्ष
14 356 ഭാരതം
15 356 भारत
16 356 republic of india
17 356 india
18 356 பாரதம்
19 356 bhāratam भारतम्
In [30]:
# putting this in a function as it can be quite slow
def run_location_matching():
    memo = {}

    # try to match location string to country numeric code
    def match_location(location):
        if location in memo: # should be redundant as values should be unique
            return memo[location]
        
        # try to match flags first
        for country, flag in zip(geo_flags['numeric'], geo_flags['flag']):
            if flag in location:
                memo[location] = country
                return country
        
        # match names/terms to country
        for country, name in zip(geo_names['numeric'], geo_names['name']):
            if name in location and re.search(rf'(^|[^a-z]){re.escape(name)}([^a-z]|$)', location, re.I):
                memo[location] = country
                return country
        
        # no match
        memo[location] = 0
        return 0
    
    # run matching for all the locations
    twt_locations['country'] = twt_locations['location'].str.lower().apply(match_location)
    
    # matching by US state codes
    # boolean mask for locations with unmatched country
    mask = twt_locations['country']==0
    unmatched = twt_locations.loc[
            mask,'location'
        ].str.replace(
            '.','', regex=False
        )
    # regex for matching any state code
    regex = "|".join(geo_states["Code"].apply(re.escape))
    # set country as USA if matches state code
    twt_locations.loc[
            mask & 
            unmatched.str.contains(
                rf'(?:^|[^a-zA-Z])(?:{regex})(?:[^a-zA-Z]|$)'
            ) &
            unmatched.str.contains(
                rf', ?(?:{regex})(?:[^a-z]|$)',
                case=False
            ),
            'country'
        ] = 840 # USA
    
    # map matched locations (twt_location) to each tweet in df_twitter
    df_twitter['location'] = df_twitter['location'].str.strip()
    df_twitter = pd.merge(
        df_twitter, twt_locations, on='location'
    ).drop('location', axis=1) # remove location field after matching
    
    # downcast to int
    df_twitter['country'] = pd.to_numeric(df_twitter['country'], downcast='integer')
    
    # probbaly redundant
    df_twitter['username'] = df_twitter['username'].astype('string')
    
    # write to csv with pyarrow
    arcsv.write_csv(Table.from_pandas(df_twitter, preserve_index=False), 'data/twitter/wordle.csv')
    # save data types
    df_twitter.dtypes.to_csv('data/twitter/dtypes.csv')

Reddit comments¶

Another platform where users share their results is Reddit. This is mainly done in the Daily Wordle threads in the r/wordle subreddit. Reddit's greater text formatting functionality allows users to also share their guess words using spoiler tags. >!SPOILER!< As such, the sharing format is similar to the standard format, but with the word in spoiler tags usually on the same line as each row of emojis.

In [31]:
reddit_raw = read_csv('data/reddit/comments.csv')
reddit_raw.info()
reddit_raw.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59614 entries, 0 to 59613
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       59614 non-null  object
 1   body         59614 non-null  object
 2   score        59614 non-null  int64 
 3   created_utc  59614 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 1.8+ MB
Out[31]:
author body score created_utc
0 lavinient Scoredle 439 4/6* \n\n14,855 \n⬛⬛🟨⬛⬛ >!SLIME... 26 1662010774
1 BlitzAceSamy Scoredle 439 3/6* \n\n|Guess|Result|Scoredle|... 11 1661943874
2 eliw23 Scoredle 439 5/6* \n\n12,974 \n⬜⬜⬜🟨⬜ >!STAIR... 9 1662007274
3 blackbeanqueen Scoredle 439 3/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!SLATE... 8 1662065267
4 MidnightExcursion Did you hear about the >!mushroom!< that got k... 7 1662019442
5 BlackJPI Scoredle 439 4/6* \n\n14,855 \n⬜⬜⬜⬜⬜ >!STEAM... 8 1662029435
6 cheezislife Scoredle 439 4/6* \n\n14,855 \n🟩⬛⬛⬛⬛ >!FLASH... 6 1662067855
7 HS007 Scoredle 439 6/6 \n\n14,855 \n* ⬛🟨⬛🟨⬛ >!PIOU... 5 1661972028
8 Pinnacle55 Scoredle 439 3/6* \n\n14,855 \n⬛⬛⬛🟨⬛ >!STONE... 5 1662020156
9 Inna_Bien >!words that end with i are a special kind of ... 5 1662022149
10 shombled Scoredle 439 3/6* \n\n14,855 \n⬛⬛⬛⬛⬛ >!STARE... 4 1662006454
11 LadeeAlana People say that I'm kind of a >!FUN GUY!< myself. 3 1662021326
12 TaurAnder Scoredle 439 3/6* \n\n14,855 \n* 🟩⬛🟨⬛⬛ >!FLI... 4 1662022421
13 cyrano4833 Scoredle 439 5/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!STORE... 3 1662028951
14 oglabradoodle Scoredle 439 4/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!STEAM... 5 1662040052
15 a-little-bit-this Scoredle 439 4/6* \n\n14,855 \n* ⬜⬜⬜🟨⬜ >!ALO... 3 1662054311
16 mohamedzu Scoredle 439 3/6* \n\n14,855 \n🟩🟨🟨⬛⬛ >!FIGHT... 4 1662055968
17 KingD123 Scoredle 439 3/6* \n\n12,974 \n⬛⬛⬛🟨⬛ >!PLANK... 4 1662068789
18 soleilady Scoredle 439 4/6* \n\n12,974 \n⬜⬜🟨⬜⬜ >!ARISE... 3 1662072125
19 calscigal Scoredle 439 3/6* \n\n12,974 \n* ⬛⬛⬛⬛⬛ >!CHA... 4 1662075325
In [32]:
r_extract = reddit_raw['body'].str.extract(
    ''.join([
        # match header and extract wordle number, score, hard mode
        r'(?:Score|Wor)dle.*(?P<wordle>[1-4]\d\d).*(?P<score>[0-6X])/6(?P<hard>\*)?',
        # find the first row of the emoji and get the spoiler word in that line
        '[^🟥🟧🟨🟩🟦🟪⬛⬜🟫]*\n[^\n]*>!(?P<word>[a-zA-Z]{5}).*!<'])
)
r_extract.head(20)
Out[32]:
wordle score hard word
0 439 4 * SLIME
1 439 3 * LOSER
2 439 5 * STAIR
3 439 3 * SLATE
4 NaN NaN NaN NaN
5 439 4 * STEAM
6 439 4 * FLASH
7 439 6 NaN PIOUS
8 439 3 * STONE
9 NaN NaN NaN NaN
10 439 3 * STARE
11 NaN NaN NaN NaN
12 439 3 * FLIRT
13 439 5 * STORE
14 439 4 * STEAM
15 439 4 * ALONE
16 439 3 * FIGHT
17 439 3 * PLANK
18 439 4 * ARISE
19 439 3 * CHART
In [33]:
# remove non matches
r_scores = r_extract.dropna(subset='wordle').copy()
# convert dtypes for efficiency
r_scores['wordle'] = r_scores['wordle'].astype('int16')
r_scores['score'] = r_scores['score'].replace(['x','X'],0).astype('int8')
r_scores['word'] = r_scores['word'].str.lower().astype('string')
r_scores['hard'] = r_scores['hard'] == '*'
r_scores.head(20)
Out[33]:
wordle score hard word
0 439 4 True slime
1 439 3 True loser
2 439 5 True stair
3 439 3 True slate
5 439 4 True steam
6 439 4 True flash
7 439 6 False pious
8 439 3 True stone
10 439 3 True stare
12 439 3 True flirt
13 439 5 True store
14 439 4 True steam
15 439 4 True alone
16 439 3 True fight
17 439 3 True plank
18 439 4 True arise
19 439 3 True chart
20 439 4 False avoid
21 439 3 True raise
22 439 3 True rugby

Word data¶

Data related to Wordle answers and guesses. Most do not require cleaning as they were scraped into proper formats already.

In [34]:
answers = pd.read_csv('data/words/answers.csv', index_col='n')
answers.head(20)
Out[34]:
answers date
n
179 trace 2021-12-15
180 using 2021-12-16
181 peach 2021-12-17
182 champ 2021-12-18
183 baton 2021-12-19
184 brake 2021-12-20
185 pluck 2021-12-21
186 craze 2021-12-22
187 gripe 2021-12-23
188 weary 2021-12-24
189 picky 2021-12-25
190 acute 2021-12-26
191 ferry 2021-12-27
192 aside 2021-12-28
193 tapir 2021-12-29
194 troll 2021-12-30
195 unify 2021-12-31
196 rebus 2022-01-01
197 boost 2022-01-02
198 truss 2022-01-03
In [35]:
freq = pd.read_csv('data/words/freq.csv')
freq.head(20)
Out[35]:
n answers freq zipf
0 179 trace 1.550000e-05 4.19
1 180 using 2.950000e-04 5.47
2 181 peach 5.890000e-06 3.77
3 182 champ 7.940000e-06 3.90
4 183 baton 4.370000e-06 3.64
5 184 brake 8.910000e-06 3.95
6 185 pluck 1.350000e-06 3.13
7 186 craze 2.240000e-06 3.35
8 187 gripe 6.460000e-07 2.81
9 188 weary 4.070000e-06 3.61
10 189 picky 2.190000e-06 3.34
11 190 acute 1.000000e-05 4.00
12 191 ferry 1.050000e-05 4.02
13 192 aside 3.890000e-05 4.59
14 193 tapir 1.950000e-07 2.29
15 194 troll 5.370000e-06 3.73
16 195 unify 1.170000e-06 3.07
17 196 rebus 3.390000e-07 2.53
18 197 boost 2.570000e-05 4.41
19 198 truss 1.950000e-06 3.29

Google Trends¶

In [36]:
trends = pd.read_csv('data/trends/trends.csv')
trends.head(20)
Out[36]:
date Wordle
0 2021-12-26 0
1 2021-12-27 0
2 2021-12-28 0
3 2021-12-29 0
4 2021-12-30 0
5 2021-12-31 1
6 2022-01-01 1
7 2022-01-02 1
8 2022-01-03 1
9 2022-01-04 4
10 2022-01-05 3
11 2022-01-06 4
12 2022-01-07 6
13 2022-01-08 7
14 2022-01-09 8
15 2022-01-10 8
16 2022-01-11 10
17 2022-01-12 14
18 2022-01-13 14
19 2022-01-14 15
In [37]:
regional = pd.read_csv('data/trends/regional.csv')
regional = regional.merge(geo_iso.reset_index(), left_on='geoCode', right_on='alpha2')
regional.head(20)
Out[37]:
geoName geoCode Wordle numeric alpha2 alpha3
0 Afghanistan AF 0 4 AF AFG
1 Albania AL 2 8 AL ALB
2 Algeria DZ 0 12 DZ DZA
3 American Samoa AS 0 16 AS ASM
4 Andorra AD 17 20 AD AND
5 Angola AO 0 24 AO AGO
6 Anguilla AI 0 660 AI AIA
7 Antarctica AQ 0 10 AQ ATA
8 Antigua & Barbuda AG 15 28 AG ATG
9 Argentina AR 4 32 AR ARG
10 Armenia AM 1 51 AM ARM
11 Aruba AW 24 533 AW ABW
12 Australia AU 55 36 AU AUS
13 Austria AT 5 40 AT AUT
14 Azerbaijan AZ 1 31 AZ AZE
15 Bahamas BS 14 44 BS BHS
16 Bahrain BH 8 48 BH BHR
17 Bangladesh BD 2 50 BD BGD
18 Barbados BB 17 52 BB BRB
19 Belarus BY 0 112 BY BLR

EDA¶

In [38]:
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from wordcloud import WordCloud
In [39]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'notebook+vscode'
In [40]:
import statsmodels.api as sm
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
In [41]:
dtypes = read_csv('data/twitter/dtypes.csv').set_index('')['0'].to_dict()
df_twitter = read_csv('data/twitter/wordle.csv', dtype=dtypes)
df_twitter.info(show_counts=True)
df_twitter.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26339556 entries, 0 to 26339555
Data columns (total 14 columns):
 #   Column    Non-Null Count     Dtype 
---  ------    --------------     ----- 
 0   wordle    26339556 non-null  int16 
 1   score     26339556 non-null  int8  
 2   hard      26339556 non-null  bool  
 3   source    26339556 non-null  int16 
 4   username  26339556 non-null  object
 5   dark      26339556 non-null  bool  
 6   contrast  26339556 non-null  bool  
 7   0         26339556 non-null  uint8 
 8   1         26339556 non-null  uint8 
 9   2         26339556 non-null  uint8 
 10  3         26339556 non-null  uint8 
 11  4         26339556 non-null  uint8 
 12  5         26339556 non-null  uint8 
 13  country   26339556 non-null  int16 
dtypes: bool(3), int16(3), int8(1), object(1), uint8(6)
memory usage: 602.9+ MB
Out[41]:
wordle score hard source username dark contrast 0 1 2 3 4 5 country
0 190 4 False 1 cafehearts True False 91 33 166 243 0 0 0
1 191 4 False 1 cafehearts True False 8 1 180 243 0 0 0
2 192 0 False 1 aquaebn False False 2 32 8 32 6 1 0
3 192 5 False 1 alejandroechev_ False False 28 94 220 220 243 0 0
4 192 4 False 2 W1LL20 True False 37 2 113 243 0 0 0
5 192 4 False 1 swansraiders False False 1 95 113 243 0 0 0
6 192 3 False 0 tezzamezza28 True False 172 221 243 0 0 0 0
7 192 3 False 1 t8tweet False False 32 113 243 0 0 0 0
8 192 3 False 1 juicyfruitygirl False False 22 47 243 0 0 0 0
9 192 2 False 1 Waterslug3 False False 44 243 0 0 0 0 0
10 192 3 False 1 Knightsmad1 False False 109 99 243 0 0 0 0
11 192 2 False 1 TheRobboBee True False 211 243 0 0 0 0 0
12 192 4 False 1 WemoBemo False False 31 12 168 243 0 0 0
13 192 3 False 1 billypilgrym False False 47 186 243 0 0 0 0
14 192 3 False 1 BTohiariki True False 166 188 243 0 0 0 0
15 192 3 False 0 GrandOldMark False False 94 51 243 0 0 0 0
16 192 4 False 0 AWMcLennan False False 43 44 99 243 0 0 0
17 192 4 False 1 judew5238 False False 30 12 168 243 0 0 0
18 192 3 False 0 electomagneticJ False False 33 222 243 0 0 0 0
19 192 4 False 3 nanbutty False False 40 94 41 243 0 0 0
In [42]:
# remove tweets where the score does not match the emoji grid
df_twitter = df_twitter[
    (df_twitter[[str(n) for n in range(0,6)]] > 0).sum(axis=1) == df_twitter['score'].replace(0,6)
].reset_index(drop=True)
df_twitter.head(20)
Out[42]:
wordle score hard source username dark contrast 0 1 2 3 4 5 country
0 190 4 False 1 cafehearts True False 91 33 166 243 0 0 0
1 191 4 False 1 cafehearts True False 8 1 180 243 0 0 0
2 192 0 False 1 aquaebn False False 2 32 8 32 6 1 0
3 192 5 False 1 alejandroechev_ False False 28 94 220 220 243 0 0
4 192 4 False 2 W1LL20 True False 37 2 113 243 0 0 0
5 192 4 False 1 swansraiders False False 1 95 113 243 0 0 0
6 192 3 False 0 tezzamezza28 True False 172 221 243 0 0 0 0
7 192 3 False 1 t8tweet False False 32 113 243 0 0 0 0
8 192 3 False 1 juicyfruitygirl False False 22 47 243 0 0 0 0
9 192 2 False 1 Waterslug3 False False 44 243 0 0 0 0 0
10 192 3 False 1 Knightsmad1 False False 109 99 243 0 0 0 0
11 192 2 False 1 TheRobboBee True False 211 243 0 0 0 0 0
12 192 4 False 1 WemoBemo False False 31 12 168 243 0 0 0
13 192 3 False 1 billypilgrym False False 47 186 243 0 0 0 0
14 192 3 False 1 BTohiariki True False 166 188 243 0 0 0 0
15 192 3 False 0 GrandOldMark False False 94 51 243 0 0 0 0
16 192 4 False 0 AWMcLennan False False 43 44 99 243 0 0 0
17 192 4 False 1 judew5238 False False 30 12 168 243 0 0 0
18 192 3 False 0 electomagneticJ False False 33 222 243 0 0 0 0
19 192 4 False 3 nanbutty False False 40 94 41 243 0 0 0
In [43]:
# number of tweets from each day
twt_counts = df_twitter['wordle'].value_counts().sort_index()
twt_counts.head(20)
Out[43]:
190         4
191         4
192      2367
193      8749
194      9681
195     10545
196     11882
197     17729
198     23976
199     26282
200     49548
201     52132
202     75490
203     90727
204     81599
205     94473
206    135818
207    121337
208    115906
209    144076
Name: wordle, dtype: int64

Popularity of Wordle¶

In [44]:
index = pd.to_datetime(answers.loc[twt_counts.index,'date'])
fig = px.scatter(
    x=index,
    y=twt_counts,
    trendline="lowess", trendline_options=dict(frac=0.06)
)
fig.update_layout(
    title_text='Number of game results posted on Twitter for each Wordle',
    hovermode='x unified'
)
#fig.update_traces(hovertemplate=None)
fig.update_yaxes(
    title='Number of Tweets',
    fixedrange=False
)
fig.update_xaxes(
    title='Date',
    rangeslider_visible=True,
    rangeslider_yaxis_rangemode='auto',
)
fig.show()
In [45]:
fig = px.scatter(
    x=index,
    y=trends['Wordle'],
    trendline="lowess", trendline_options=dict(frac=0.08)
)
fig.update_layout(
    title_text='Google Trends search volume of Wordle',
    hovermode='x unified'
)
#fig.update_traces(hovertemplate=None)
fig.update_yaxes(
    title='Relative search volume',
    fixedrange=False
)
fig.update_xaxes(
    title='Date',
    rangeslider_visible=True,
    rangeslider_yaxis_rangemode='auto',
)
fig.show()
In [46]:
# All users who have posted wordle and the first wordle they posted
twt_users = df_twitter.drop_duplicates(subset='username')[['wordle','username']]
twt_users.head(20)
Out[46]:
wordle username
0 190 cafehearts
2 192 aquaebn
3 192 alejandroechev_
4 192 W1LL20
5 192 swansraiders
6 192 tezzamezza28
7 192 t8tweet
8 192 juicyfruitygirl
9 192 Waterslug3
10 192 Knightsmad1
11 192 TheRobboBee
12 192 WemoBemo
13 192 billypilgrym
14 192 BTohiariki
15 192 GrandOldMark
16 192 AWMcLennan
17 192 judew5238
18 192 electomagneticJ
19 192 nanbutty
20 192 99FascinationSt
In [47]:
# Total number of users post for each day
twt_users_all = df_twitter.groupby('wordle')['username'].unique().str.len().to_frame(name='total')
# New users (based on first wordle)
twt_users_all['new'] = twt_users['wordle'].value_counts().sort_index()
# Users who are not new
twt_users_all['repeat'] = twt_users_all['total'] - twt_users_all['new']
twt_users_all.head(20)
Out[47]:
total new repeat
wordle
190 4 4 0
191 4 2 2
192 2353 2349 4
193 8663 7160 1503
194 9601 4767 4834
195 10457 3699 6758
196 11771 4205 7566
197 17562 7501 10061
198 23784 10285 13499
199 26094 11276 14818
200 49093 25870 23223
201 51671 21131 30540
202 74632 32080 42552
203 89638 35373 54265
204 80846 22851 57995
205 93668 29749 63919
206 134375 45261 89114
207 120150 30509 89641
208 114920 24858 90062
209 142772 38171 104601
In [48]:
fig = px.area(
    twt_users_all, 
    x=twt_users_all.index, 
    y=['new', 'repeat'],
    labels={'variable':'Users'},
    title='Number of new and repeat users posting Wordle scores over time'
)
fig.update_yaxes(title='Number of users',fixedrange=False)
fig.update_xaxes(
    title='Wordle',
    rangeslider_visible=True,
    rangeslider_yaxis_rangemode='auto'
)
fig.update_traces(hovertemplate=None)
fig.update_layout(hovermode='x unified')
fig.show()
In [49]:
fig = px.area(
    twt_users_all.divide(twt_users_all['total'], axis=0), 
    x=twt_users_all.index, 
    y=['new', 'repeat'],
    labels={'variable':'Users'},
    title='Proportion of new and repeat users posting Wordle scores over time',
    range_y=[0,1]
)
fig.update_yaxes(title='Proportion of users',fixedrange=False)
fig.update_xaxes(
    title='Wordle',
    rangeslider_visible=True,
    rangeslider_yaxis_rangemode='auto'
)
fig.update_traces(hovertemplate=None)
fig.update_layout(hovermode='x unified')
fig.show()

Countries¶

In [50]:
country_users = df_twitter.groupby('country')['username'].unique().str.len().drop(0).rename('count')
country_users = pd.merge(country_users, geo_iso, left_index=True, right_index=True)
country_users['log'] = np.log10(country_users['count'])
country_users.head(20)
Out[50]:
count alpha2 alpha3 log
4 53 AF AFG 1.724276
8 53 AL ALB 1.724276
10 3 AQ ATA 0.477121
12 119 DZ DZA 2.075547
16 3 AS ASM 0.477121
20 28 AD AND 1.447158
24 29 AO AGO 1.462398
28 78 AG ATG 1.892095
31 91 AZ AZE 1.959041
32 3299 AR ARG 3.518382
36 23928 AU AUS 4.378906
40 1395 AT AUT 3.144574
44 52 BS BHS 1.716003
48 334 BH BHR 2.523746
50 548 BD BGD 2.738781
51 71 AM ARM 1.851258
52 287 BB BRB 2.457882
56 1782 BE BEL 3.250908
60 8 BM BMU 0.903090
64 40 BT BTN 1.602060
In [51]:
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
    geo_data='data/geo/countries.geojson',
    name="choropleth",
    columns=['alpha3','log'],
    data=country_users,
    key_on="feature.properties.ISO_A3",
    fill_color="YlGnBu",
    legend_name='Log10 of the number of users'
).add_to(m)
m
Out[51]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [52]:
geo_internet[geo_internet['users']==0]
Out[52]:
numeric country users
141 408 North Korea 0
In [53]:
country_relative = country_users.merge(geo_internet[geo_internet['users']>0], left_index=True, right_on='numeric')
country_relative['relative'] = country_relative['count'] / country_relative['users']
country_relative.head(20)
Out[53]:
count alpha2 alpha3 log numeric country users relative
1 53 AF AFG 1.724276 4 Afghanistan 3672058 0.000014
3 53 AL ALB 1.724276 8 Albania 1942025 0.000027
49 119 DZ DZA 2.075547 12 Algeria 17438483 0.000007
2 29 AO AGO 1.462398 24 Angola 3745750 0.000008
7 78 AG ATG 1.892095 28 Antigua and Barbuda 73703 0.001058
10 91 AZ AZE 1.959041 31 Azerbaijan 7763795 0.000012
5 3299 AR ARG 3.518382 32 Argentina 31118074 0.000106
8 23928 AU AUS 4.378906 36 Australia 21288325 0.001124
9 1395 AT AUT 3.144574 40 Austria 7681572 0.000182
18 52 BS BHS 1.716003 44 Bahamas 312986 0.000166
17 334 BH BHR 2.523746 48 Bahrain 1431062 0.000233
15 548 BD BGD 2.738781 50 Bangladesh 29733670 0.000018
6 71 AM ARM 1.851258 51 Armenia 1882003 0.000038
24 287 BB BRB 2.457882 52 Barbados 226713 0.001266
12 1782 BE BEL 3.250908 56 Belgium 10021205 0.000178
26 40 BT BTN 1.602060 64 Bhutan 333248 0.000120
22 51 BO BOL 1.707570 68 Bolivia 4322219 0.000012
19 72 BA BIH 1.857332 70 Bosnia and Herzegovina 2437067 0.000030
27 2004 BW BWA 3.301898 72 Botswana 885770 0.002262
23 8949 BR BRA 3.951775 76 Brazil 126403573 0.000071
In [54]:
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
    geo_data='data/geo/countries.geojson',
    name="choropleth",
    columns=['alpha3','relative'],
    data=country_relative[
        # remove outliers
        country_relative['relative'] < country_relative['relative'].quantile(0.999)
    ],
    key_on="feature.properties.ISO_A3",
    fill_color="YlGnBu",
    legend_name='Proportion of internet users in country'
).add_to(m)
m
Out[54]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [55]:
country_counts = np.log10(df_twitter['country'].value_counts().drop(0).rename('count'))
country_counts = pd.merge(country_counts, geo_iso, left_index=True, right_index=True)
country_counts.head(20)
Out[55]:
count alpha2 alpha3
840 6.724513 US USA
826 6.272636 GB GBR
124 5.897656 CA CAN
356 5.793000 IN IND
36 5.648072 AU AUS
372 5.395740 IE IRL
392 5.373050 JP JPN
608 5.189496 PH PHL
554 5.113823 NZ NZL
710 5.072467 ZA ZAF
276 5.037128 DE DEU
458 5.021557 MY MYS
120 4.993912 CM CMR
360 4.979453 ID IDN
288 4.900236 GH GHA
76 4.868850 BR BRA
388 4.864879 JM JAM
643 4.851754 RU RUS
724 4.847029 ES ESP
706 4.782766 SO SOM
In [56]:
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
    geo_data='data/geo/countries.geojson',
    name="choropleth",
    columns=['alpha3','count'],
    data=country_counts,
    key_on="feature.properties.ISO_A3",
    fill_color="YlGnBu",
    legend_name='Log10 of the number of tweets'
).add_to(m)
m
Out[56]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Google Trends Search Interest by Region¶

In [57]:
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
    geo_data='data/geo/countries.geojson',
    name="choropleth",
    columns=['alpha3','Wordle'],
    data=regional,
    key_on="feature.properties.ISO_A3",
    fill_color="YlGnBu",
    legend_name='Search interest',
).add_to(m)
m
Out[57]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Average scores¶

In [58]:
twt_mean = df_twitter[df_twitter['score']>0].groupby('wordle')['score'].mean().rename('mean').to_frame()
twt_mean['answers'] = answers.loc[twt_mean.index,'answers']
twt_mean.head(20)
Out[58]:
mean answers
wordle
190 4.250000 acute
191 4.250000 ferry
192 3.816205 aside
193 3.530411 tapir
194 3.752526 troll
195 4.327278 unify
196 4.391588 rebus
197 3.989553 boost
198 4.023979 truss
199 4.458242 siege
200 3.921375 tiger
201 4.335495 banal
202 4.078716 slump
203 4.167350 crank
204 4.558212 gorge
205 4.373089 query
206 3.684007 drink
207 4.440339 favor
208 4.525946 abbey
209 4.188266 tangy
In [59]:
scores_count = df_twitter.groupby('wordle')['score'].value_counts()
scores_count.name = 'count'
scores_count = scores_count.reset_index()
scores_count['score'] = scores_count['score'].replace(0,'X').astype('string')
scores_count = scores_count.merge(scores_count.groupby('wordle')['count'].sum().rename('total'), left_on='wordle', right_index=True)
scores_count['relative'] = scores_count['count'] / scores_count['total'] * 100
scores_count.head(20)
Out[59]:
wordle score count total relative
0 190 4 3 4 75.000000
1 190 5 1 4 25.000000
2 191 2 1 4 25.000000
3 191 4 1 4 25.000000
4 191 5 1 4 25.000000
5 191 6 1 4 25.000000
6 192 4 777 2367 32.826362
7 192 3 763 2367 32.234896
8 192 5 444 2367 18.757921
9 192 2 196 2367 8.280524
10 192 6 155 2367 6.548373
11 192 X 22 2367 0.929447
12 192 1 10 2367 0.422476
13 193 3 3587 8749 40.998971
14 193 4 2689 8749 30.734941
15 193 2 1057 8749 12.081381
16 193 5 1003 8749 11.464167
17 193 6 348 8749 3.977597
18 193 X 35 8749 0.400046
19 193 1 30 8749 0.342896
In [60]:
scores_count_total = scores_count.groupby('score')['count'].sum()
fig = px.bar(
    scores_count_total, # Total count of each score
    x='count',
    text_auto = '.3s',
    title='Total number of tweets of each Wordle score',
)
fig.update_traces(
    hovertemplate=None,
    marker_color=['#0a5e03','#50ba47','#b1f073','#f7f75c','#e8ac15','#cf4911','#851101'],
    marker_line=dict(width=1.2, color='black')
)
fig.update_layout(hovermode='y')
fig.show()
In [61]:
fig = px.pie(
    names=scores_count_total.index, values=scores_count_total, hole=0.5,
    title='Proportion of tweets with score'
)
fig.update_traces(
    marker_colors=['#0a5e03','#50ba47','#b1f073','#f7f75c','#e8ac15','#cf4911','#851101'],
    marker_line=dict(width=1.2, color='black')
)
fig.show()
In [62]:
fig = px.area(scores_count, x='wordle', y='count', color='score', 
        category_orders={'score':['1','2','3','4','5','6','X']},
        color_discrete_map={'1':'#0a5e03','2':'#50ba47','3':'#b1f073','4':'#f7f75c','5':'#e8ac15','6':'#cf4911','X':'#851101'},
        title='Number of tweets by score over time'
        )
fig.update_yaxes(title='Number of Tweets',fixedrange=False)
fig.update_xaxes(title='__Date__',rangeslider_visible=True,rangeslider_yaxis_rangemode='auto') #TODO date
fig.update_traces(hovertemplate=None)
fig.update_layout(hovermode='x unified')
fig.show()
In [63]:
fig = px.area(scores_count, x='wordle', y='relative', color='score', 
        category_orders={'score':['1','2','3','4','5','6','X']},
        color_discrete_map={'1':'#0a5e03','2':'#50ba47','3':'#b1f073','4':'#f7f75c','5':'#e8ac15','6':'#cf4911','X':'#851101'},
        range_y=[0,100],
        title='Proportion of tweets with score over time'
        )
fig.update_yaxes(title='Proportion of Tweets',fixedrange=False)
fig.update_xaxes(title='__Date__',rangeslider_visible=True,rangeslider_yaxis_rangemode='auto') #TODO date
fig.update_traces(hovertemplate='%{y:.2f}%')
fig.update_layout(hovermode='x unified')
fig.show()
In [64]:
scores_count_sorted = scores_count.merge(twt_mean, on='wordle').sort_values(by=['mean','wordle'])
scores_count_sorted.head(20)
Out[64]:
wordle score count total relative mean answers
895 319 3 32253 97930 32.934749 3.101278 train
896 319 2 26394 97930 26.951904 3.101278 train
897 319 4 21382 97930 21.833963 3.101278 train
898 319 5 8770 97930 8.955376 3.101278 train
899 319 1 5822 97930 5.945063 3.101278 train
900 319 6 3001 97930 3.064434 3.101278 train
901 319 X 308 97930 0.314510 3.101278 train
811 307 3 43884 111993 39.184592 3.312547 plant
812 307 4 30758 111993 27.464217 3.312547 plant
813 307 2 21736 111993 19.408356 3.312547 plant
814 307 5 10508 111993 9.382729 3.312547 plant
815 307 6 2904 111993 2.593019 3.312547 plant
816 307 1 1918 111993 1.712607 3.312547 plant
817 307 X 285 111993 0.254480 3.312547 plant
720 294 3 45538 126430 36.018350 3.341545 stair
721 294 4 33418 126430 26.432018 3.341545 stair
722 294 2 26760 126430 21.165863 3.341545 stair
723 294 5 13236 126430 10.469034 3.341545 stair
724 294 6 4536 126430 3.587756 3.341545 stair
725 294 1 1948 126430 1.540774 3.341545 stair
In [65]:
fig = px.area(scores_count_sorted, x='answers', y='relative', color='score', 
        category_orders={'score':['1','2','3','4','5','6','X']},
        color_discrete_map={'1':'#0a5e03','2':'#50ba47','3':'#b1f073','4':'#f7f75c','5':'#e8ac15','6':'#cf4911','X':'#851101'},
        range_y=[0,100],
        title='Proportion of tweets with score sorted by mean score'
        )
fig.update_yaxes(title='Number of Tweets',fixedrange=False)
fig.update_xaxes(title='Word',rangeslider_visible=True,rangeslider_yaxis_rangemode='auto') #TODO date
fig.update_traces(hovertemplate='%{y:.2f}%')
fig.update_layout(hovermode='x unified')
fig.show()
In [66]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=index,
    y=twt_mean['mean'],
    mode='lines+markers',
    name='Mean score',
    text=twt_mean['answers'],
    hovertemplate='<b>%{text}</b><br>Mean score: %{y}<br><extra></extra>'
))
fig.add_trace(go.Scatter(
    x=index,
    y=np.array(
        sm.nonparametric.lowess(twt_mean['mean'], twt_mean.index, frac=8/100)
    )[:,1],
    mode='lines',
    name='LOWESS fit',
    line=dict(width=3),hovertemplate=None
))
mean = df_twitter['score'].mean()
fig.update_layout(
    title_text='Average (mean) score of Wordle games',
    hovermode='x unified'
)
#fig.update_traces(hovertemplate=None)
fig.update_yaxes(
    title='Mean score',
    fixedrange=False,
    range=[3,5]
)
fig.update_xaxes(
    title='Date',
    range=[index.iloc[0], index.iloc[-1]],
    rangeslider_visible=True,
    rangeslider_yaxis_rangemode='auto',
    rangeslider_range=[index.iloc[0], index.iloc[-1]]
)
fig.show()
In [67]:
px.box(x=twt_mean['mean'], title='Distribution of the average score of each day')
In [68]:
answer_trends = pd.read_csv('data/words/trends.csv', index_col=0)
answer_trends = answer_trends.apply(lambda col: pd.DataFrame(sm.nonparametric.lowess(col, col.index, frac=14/100)).set_index(0).squeeze(), axis=0)
answer_spikes = answer_trends.iloc[50:100].mean() / answer_trends.iloc[np.r_[:50,100:]].mean()
answer_spikes.name = 'spikes'

answer_spikes.sort_values().tail(20)
Out[68]:
slosh    2.905279
aphid    2.905921
ruder    2.909033
coyly    2.909570
gloat    2.917525
midge    2.918356
glean    2.922953
canny    2.925490
lowly    2.927586
piety    2.937736
crept    2.949311
droll    2.949753
wrung    2.953050
cynic    2.954943
gawky    2.966924
swill    2.975927
tacit    2.976672
foray    2.980783
parer    2.997150
trice    2.997755
Name: spikes, dtype: float64

'spikes' refers to the Google Trends search interest of a word around the time it is an answer. With more unknown answers, there are likely to be more people searching for the word online. However, this is a somewhat arbitrary measure.

In [69]:
mean_spikes = twt_mean.merge(answer_spikes, left_on='answers', right_index=True)
px.scatter(
    mean_spikes,
    y=mean_spikes['spikes'], x='mean',
    marginal_x='histogram',
    marginal_y='histogram'
)
In [70]:
stats.pearsonr(mean_spikes['mean'], mean_spikes['spikes'])
Out[70]:
(0.30508543269175975, 8.770829318927126e-07)
In [71]:
ans_freq = freq.merge(twt_mean)
ans_freq.head(20)
Out[71]:
n answers freq zipf mean
0 190 acute 1.000000e-05 4.00 4.250000
1 191 ferry 1.050000e-05 4.02 4.250000
2 192 aside 3.890000e-05 4.59 3.816205
3 193 tapir 1.950000e-07 2.29 3.530411
4 194 troll 5.370000e-06 3.73 3.752526
5 195 unify 1.170000e-06 3.07 4.327278
6 196 rebus 3.390000e-07 2.53 4.391588
7 197 boost 2.570000e-05 4.41 3.989553
8 198 truss 1.950000e-06 3.29 4.023979
9 199 siege 8.510000e-06 3.93 4.458242
10 200 tiger 2.000000e-05 4.30 3.921375
11 201 banal 7.590000e-07 2.88 4.335495
12 202 slump 2.290000e-06 3.36 4.078716
13 203 crank 4.070000e-06 3.61 4.167350
14 204 gorge 2.510000e-06 3.40 4.558212
15 205 query 4.370000e-06 3.64 4.373089
16 206 drink 7.940000e-05 4.90 3.684007
17 207 favor 3.800000e-05 4.58 4.440339
18 208 abbey 7.590000e-06 3.88 4.525946
19 209 tangy 4.370000e-07 2.64 4.188266
In [72]:
px.scatter(
    ans_freq,
    y='zipf', x='mean',
    marginal_x='histogram',
    marginal_y='histogram'
)
In [73]:
stats.pearsonr(ans_freq['mean'], ans_freq['zipf'])
Out[73]:
(-0.35770111812269373, 5.8445702885770625e-09)
In [74]:
answers['answers'].apply(set).str.len()
Out[74]:
n
179    5
180    5
181    5
182    5
183    5
      ..
454    4
455    5
456    5
457    5
458    5
Name: answers, Length: 280, dtype: int64
In [75]:
twt_mean.merge(
    answers['answers'].apply(set).str.len().rename('unique'), 
    how='inner', left_index=True, right_index=True)
Out[75]:
mean answers unique
190 4.250000 acute 5
191 4.250000 ferry 4
192 3.816205 aside 5
193 3.530411 tapir 5
194 3.752526 troll 4
... ... ... ...
435 4.686065 gauze 5
436 3.816523 chief 5
437 3.959463 onset 5
438 4.652926 prize 5
439 4.255168 fungi 5

250 rows × 3 columns

In [76]:
ans_unique = twt_mean.merge(answers['answers'].apply(set).str.len().rename('unique'), how='inner', left_index=True, right_index=True)
fig = px.scatter(
    ans_unique,
    y='unique', x='mean',custom_data=['answers'],
    marginal_x='histogram',
    marginal_y='histogram'
)
fig.update_traces(hovertemplate='<b>%{customdata[0]}</b><br>Mean score: %{x}<br>Unique letter: %{y}',selector=dict(type='scatter'))
In [77]:
stats.pearsonr(ans_unique['mean'], ans_unique['unique'])
Out[77]:
(-0.4254767213877685, 2.056215220036592e-12)
In [78]:
def match(guess, answer):
    result = [0] * 5
    guess_matched = [False] * 5
    answer_matched = [False] * 5
    
    for i in range(len(guess)):
        if guess[i] == answer[i] and not answer_matched[i]:
            result[i] = 2
            guess_matched[i] = True
            answer_matched[i] = True
    
    for i in range(len(guess)):
        guess_letter = guess[i]
        if not guess_matched[i]:
            for j in range(len(answer)):
                if not answer_matched[j] and guess_letter == answer[j]:
                    result[i] = 1
                    answer_matched[j] = True
                    break
    
    return result
In [79]:
def base3(n):
    digits = []
    while n:
        digits.append(n % 3)
        n //= 3
    return ([0,0,0,0,0] + (digits[::-1] or [0]))[-5:]
In [80]:
greens = pd.concat([df_twitter['score'], df_twitter['0'].map((pd.Series(range(0,243), index=range(1,244)).apply(lambda n: pd.Series(base3(n))) == 2).sum(axis=1)).astype('int8').rename('greens')], axis=1)
px.scatter(
    greens.value_counts().rename('count').reset_index(),
    x='score',y='greens',size='count',
    title='Comparision of the number of greens in the first row vs score, and the count of the combination'
)
In [81]:
stats.pearsonr(greens['score'], greens['greens'])
Out[81]:
(-0.2692805753434382, 0.0)
In [82]:
greys = pd.concat([df_twitter['score'], df_twitter['0'].map((pd.Series(range(0,243), index=range(1,244)).apply(lambda n: pd.Series(base3(n))) == 0).sum(axis=1)).astype('int8').rename('greys')], axis=1)
In [83]:
stats.pearsonr(greys['score'], greys['greys'])
Out[83]:
(0.26993847548792826, 0.0)

Starting words¶

In [84]:
wordcloud = WordCloud(width = 1600, height = 900,
                background_color ='white',
                min_font_size = 10).generate_from_frequencies(r_scores['word'].value_counts())

fig = px.imshow(wordcloud)
fig.update_layout(margin=dict(l=10,r=10,t=10,b=10),hovermode=False)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
In [85]:
word_scores = pd.merge(
    r_scores.loc[r_scores['score']>0].groupby('word')['score'].mean(),
    r_scores['word'].value_counts().rename('count'), 
    left_index=True, right_index=True
)
word_scores.head(20)
Out[85]:
score count
abase 5.000000 1
abate 3.000000 1
abhor 4.333333 3
abide 4.285714 7
abled 3.000000 1
abode 4.500000 2
abort 3.555556 9
about 3.952381 130
above 3.833333 6
abuse 3.666667 6
abysm 4.000000 1
acers 6.000000 1
ached 2.000000 1
aches 4.500000 2
acids 4.000000 1
acorn 3.500000 20
acres 3.600000 5
acrid 4.000000 7
actor 3.714286 7
acute 3.333333 15
In [86]:
fig = px.scatter(
    word_scores.sort_values(by='count').tail(50).reset_index(), 
    x='score', 
    y='count',
    text='index',
    title='Number of uses vs mean score of the 50 most popular starting words'
)
fig.update_traces(textposition='top center')
fig.show()
In [87]:
fig = px.scatter(
    word_scores[word_scores['count']>5].sort_values(by='score').head(50).reset_index(), 
    x='score', 
    y='count',
    text='index',
    title='Number of uses vs mean score of the 50 best starting words (by mean)'
)
fig.update_traces(textposition='top center')
fig.show()

Results Findings & Conclusion¶

Question 1: How has Wordle's popularity changed since it was first released to the public?¶

Google Trends¶

To analyse the popularity of things, Google Trends can be used. Google Trends data can be used to gauge the popularity of a keyword online by measuring the web search volume of the keyword. By acquiring the search volume of 'Wordle' over time, it can used as a useful estimate of how popular the game is.

In [88]:
df_pop = pd.DataFrame({
    'tweets':sm.nonparametric.lowess(twt_counts, twt_counts.index, frac=0.06)[:,1],
    'search':sm.nonparametric.lowess(trends['Wordle'], trends.index, frac=0.1)[:,1]
})
In [89]:
fig = px.line(
    x=index,
    y=df_pop['search'] / df_pop['search'].max() * 100,
    title='LOWESS fit of the relative search volume for Wordle over time'
)
fig.update_layout(
    hovermode='x unified'
)
fig.update_traces(hovertemplate='%{y:.2f}%')
fig.update_yaxes(
    title='Relative volume (%)',
    fixedrange=False
)
fig.update_xaxes(
    title='Date',
    rangeslider_visible=True,
    rangeslider_yaxis_rangemode='auto'
)
fig.show()

The Locally Weighted Scatterplot Smoothing (LOWESS) algorithm was applied to the raw data to reduce noise and more clearly identify trends.

The search volume for Wordle increased from December 2021, representing its surge in popularity after initially being released to the public in October. The search volume reached a peak in February 2022, with the highest search volume recorded on 17 February 2022. Since then, the search volume has been generally on the decline, decreasing to 40% of the maximum search volume in September 2022.

Twitter¶

In mid December 2021, Wordle introduced the sharing feature, which allowed users to share their game result wihtout spoiling the game for others. This is acheived by sharing a grid of coloured emoji squares, with each row representing a guess, and the colour of the emoji square representing whether the letter is absent, present, or correct. This could be easily shared on social media platforms like Twitter, where the cryptic emoji grids quickly went viral.

Again with the rhyming game. I lucked out this time!#Wordle420 4/6*

⬛⬛⬛🟨⬛
⬛⬛🟩⬛🟩
⬛🟩🟩🟩🟩
🟩🟩🟩🟩🟩 pic.twitter.com/cySnOsjYSg

— Gayla (@AZGayla) August 13, 2022

Wordle 420 X/6

⬛🟧⬛⬛⬛
⬛🟧⬛⬛⬛
⬛🟧⬛🟦🟧
🟧🟧⬛⬛🟧
🟧🟧⬛⬛🟧
🟧🟧⬛🟧🟧

this was so rude

— 🎃Hogueus Woahcus 🎃 (@HogueLikeWoah) August 14, 2022
In [90]:
fig = px.scatter(
    x=index,
    y=twt_counts,
    trendline="lowess", trendline_options=dict(frac=0.06),
    title='The number of Wordle scores shared on Twitter over time'
)
fig.update_layout(
    hovermode='x unified'
)
#fig.update_traces(hovertemplate=None)
fig.update_yaxes(
    title='Number of tweets',
    fixedrange=False
)
fig.update_xaxes(
    title='Date',
    rangeslider_visible=True,
    rangeslider_yaxis_rangemode='auto'
)
fig.update_traces(marker_size=5, hovertemplate=None)
fig.update_traces(hovertemplate='%{y} <b>(LOWESS trend)</b>', selector = dict(type='scatter', mode='lines'))
fig.show()

The number of Wordle scores shared on Twitter follows a similar trend, with an increase from December 2021 to a peak in February 2022, and a subsequent decrease. However, it has a slightly earlier peak, with a maximum of 321.7k tweets on 4 February 2022. Compared to the search volume, Tweet volume decreased more significantly, dropping to around 11% of the maximum by September 2022. This could suggest that rather than just being a measure of Wordle's popularity, the tweet volume is also a factor in popularity. As more people share their scores on Twitter, more people are exposed to the game, and would be more likely to search for the game online, or even play the game. The more significant decline in tweet volume as compared to search volume suggests that sharing Wordle scores online has become less popular, with Wordle itself not necessary having as significant of a decline. Wordle is no longer nearly as viral on Twitter as it once was during its peak, with such tweets less common as it becomes less of a popular thing to share.

For further insight, we can look at the number of (unique) users sharing their game everyday, and compare the proportion of new and repeat users.

In [91]:
fig = px.area(
    twt_users_all.divide(twt_users_all['total'], axis=0) * 100, 
    x=index, 
    y=['new', 'repeat'],
    labels={'variable':'Users'},
    title='Proportion of new and repeat users posting Wordle scores over time'
)
fig.update_yaxes(title='Proportion of users (%)',fixedrange=False)
fig.update_xaxes(
    title='Date',
    rangeslider_visible=True,
    rangeslider_yaxis_rangemode='auto'
)
fig.update_traces(hovertemplate='%{y:.2f}%')
fig.update_layout(hovermode='x unified')
fig.show()

We can see that the proportion of new users posting Wordle scores on Twitter has generally been declining since the game's release. The inital increase in users and tweets is explained by the relatively high proportion of new users near the start. As the proportion of new users declines, the growth slows, and eventually becomes negligible. After February 2022, the peak in Wordle's popularity, the proportion of new users drops to below 5% on most days, with only around 1% being new users after May 2022. This is an indication that there is much less interest or hype surrounding the game, with less people being interested to start playing the game. In addition, this shows that the decline in plays is primarily due to existing players losing interest and no longer playing, as the affect of new users is inconsequential. However, it could also just mean that sharing Wordle scores is no longer as popular, but this would also decrease people's exposure to the game as they see less Wordle scores.

Question 2: Where is wordle most popular?¶

Besides interest over time, Google Trends also has data on the search interest by region, which is useful for seeing where a search term is most popular. The search interest by region is a measure relative to the total number of Google searches in the region at a specific time.

In [92]:
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
    geo_data='data/geo/countries.geojson',
    name="choropleth",
    columns=['alpha3','Wordle'],
    data=regional,
    key_on="feature.properties.ISO_A3",
    fill_color="YlGnBu",
    legend_name='Search interest',
).add_to(m)
m
Out[92]:
Make this Notebook Trusted to load map: File -> Trust Notebook

For clarity, I have also displayed the top values in a table:

In [93]:
regional[['geoName','Wordle']].rename(
    columns={'geoName':'country','Wordle':'search interest'}
).sort_values(by='search interest',ascending=False).head(10)
Out[93]:
country search interest
92 Guernsey 100
155 New Zealand 97
106 Ireland 96
107 Isle of Man 81
112 Jersey 73
235 United Kingdom 69
38 Canada 61
85 Gibraltar 56
12 Australia 55
236 United States 53

This indicates that Wordle is most popular in the countries New Zealand, x and x.

We can also look at the tweets of Wordle games to compare by country. Twitter has a (optional) location field where users can specify their location. This can be used to classify users by country.

In [94]:
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
    geo_data='data/geo/countries.geojson',
    name="choropleth",
    columns=['alpha3','log'],
    data=country_users,
    key_on="feature.properties.ISO_A3",
    fill_color="YlGnBu",
    legend_name='Log10 of the number of users'
).add_to(m)
m
Out[94]:
Make this Notebook Trusted to load map: File -> Trust Notebook

By looking at the number of users, USA has by far the most people sharing Wordle games on Twitter. However, there are some limitations to this analysis. The location data is derived from the location field of the user, which is simply a text field where users can input anything. As such, the locations may not be exactly accurate or complete, and may not be representative of the actual distribution of countries (i.e. some countries may be more likely to specify location on Twitter).

We can then take the number of users as a proportion of the number of internet users in the country to get an estimate of the proportion of people in a country who post Wordle games on Twitter.

In [95]:
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
    geo_data='data/geo/countries.geojson',
    name="choropleth",
    columns=['alpha3','relative'],
    data=country_relative[
        # remove outliers
        country_relative['relative'] < country_relative['relative'].quantile(0.999)
    ],
    key_on="feature.properties.ISO_A3",
    fill_color="YlGnBu",
    legend_name='Proportion of internet users in country'
).add_to(m)
m
Out[95]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This yields a similar result to the Google Trends map, with countries like New Zealand and Ireland with higher proportions. Other countries like, USA, UK, and Canada also have relatively high proportions of people posting Wordle games on Twitter. This is not necessary a direct measure of how popular Wordle is in each country, as the popularity of Twitter varies among countries, and people in some countries may be more likely to post scores on Twitter. However, this still provides a decent estimate of Wordle's popularity by country.

Question 3: What is the average Wordle score?¶

4 is a Wordle par. 3 a birdie. 2 an eagle. 5 a bogey. 6 a double bogey. After 22 days I'm at -5.

— Tim Urban (@waitbutwhy) January 26, 2022
In [96]:
scores_count_total = scores_count.groupby('score')['count'].sum()
fig = px.bar(
    scores_count_total, # Total count of each score
    x='count',
    text_auto = '.3s',
    title='Total number of tweets of each Wordle score'
)
fig.update_traces(
    hovertemplate=None,
    marker_color=['#0a5e03','#50ba47','#b1f073','#f7f75c','#e8ac15','#cf4911','#851101'],
    marker_line=dict(width=1.2, color='black')
)
fig.update_layout(hovermode='y')
fig.show()

Having a score of 4 is by far the most common, with 8.38M tweets with a score of 4. This matches the general sentiment that a score of 4 is normal, or average, or unspectacular, with lower scores considered good, and higher scores considered poor. A score of 3 is the next most common with 6.20M tweets, and with scores of 5 close behind at just below 6M tweets. The majority of tweets fall within this score range, with the other scores often being regarded as just poor or great luck. Having a score of 1 is by far the rarest, with only 188k tweets. In general, 4 is most common score, with those above 4 being more common than those below, which makes intuitive sense as it would take more skill and luck to get lower scores.

In [97]:
df_twitter[df_twitter['score']>0]['score'].describe().to_frame()
Out[97]:
score
count 2.560955e+07
mean 4.081433e+00
std 1.137395e+00
min 1.000000e+00
25% 3.000000e+00
50% 4.000000e+00
75% 5.000000e+00
max 6.000000e+00

The median and mode score are both 4, and the mean score of all tweets is 4.08. Hence, we can conclude that the average wordle game would have a score of 4. The IQR of 3 to 5 also shows that the majority of games yield such a score.

In [98]:
fig = px.scatter(
    x=index,
    y=twt_mean['mean'],
    #mode='lines+markers',
    #name='Mean score',
    custom_data=[twt_mean['answers']],
    marginal_y='box',
    labels=dict(x="Date", y="Mean score")
)
fig.update_traces(
    hovertemplate='<b>%{customdata[0]}</b><br>Mean score: %{y}<br><extra></extra>',
    selector=dict(type='scatter')
)
fig.update_layout(
    title_text='Average (mean) score of Wordle games over time',
    #hovermode='x unified',
    #selector=dict(type='scatter')
)
fig.show()

As shown, the average scores of each wordle game generally lies around 4. More specifically, the median of the average score of each Wordle is 4.12.

Question 4: What makes some Wordles more difficult than others?¶

In [99]:
fig = px.scatter(
    ans_freq,
    y='zipf', x='mean', custom_data=['answers'],
    marginal_x='histogram',
    marginal_y='histogram',
    title='zipf frequency vs average score of word'
)
fig.update_traces(hovertemplate='<b>%{customdata[0]}</b><br>Mean score: %{x}<br>zipf: %{y}',selector=dict(type='scatter'))
In [100]:
stats.pearsonr(ans_freq['mean'], ans_freq['zipf'])
Out[100]:
(-0.35770111812269373, 5.8445702885770625e-09)
In [101]:
ans_unique = twt_mean.merge(answers['answers'].apply(set).str.len().rename('unique'), how='inner', left_index=True, right_index=True)
fig = px.scatter(
    ans_unique,
    y='unique', x='mean',custom_data=['answers'],
    marginal_x='histogram',
    marginal_y='histogram',
    title='Number of unique letters vs average score of word',
    trendline='ols'
)
fig.update_traces(hovertemplate='<b>%{customdata[0]}</b><br>Mean score: %{x}<br>Unique letter: %{y}',selector=dict(type='scatter'))
In [102]:
stats.pearsonr(ans_unique['mean'], ans_unique['unique'])
Out[102]:
(-0.4254767213877685, 2.056215220036592e-12)

Both the word frequency and number of unique letters have an effect on the average score of a Wordle. Higher word frequency implies that the word is more likely to be known, which decreases the average number of guesses needed. Have repeated letters may be less expected to players and harder to guess, which can result in a higher number of guesses. However, while there is some correlation between the factors and the average difficultly, the correlation is not strong.

In [103]:
ans_factors = ans_freq.merge(ans_unique)
mlm = LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(ans_factors[['zipf','unique']], ans_factors['mean'], test_size=0.3, random_state=0)
mlm.fit(x_train, y_train)

y_hat = mlm.predict(x_test)
sns.kdeplot(y_test, color='r', label='Actual Value')
sns.kdeplot(y_hat,  color='b', label='Fitted Value')
plt.legend()
plt.show()

Evidently, it is not very accurate to use these two factors alone to predict score. While the mode value is similar, the density is quite far off.

As a result of human nature, this result is not unexpected. Logically there is some correlation, but due to human nature, the correlation is not strong. As different people attempt the same wordle, they have different skill levels, different approaches to solving, etc. As such, the average score is not necessarily a good measure of difficultly. Taking this into consideration, I believe it is fair to consider the (mild) correlation between the factors and the average score as a decent measure for difficultly.

Question 5: Which is the most common starting word and how much does it affect the score?¶

To get an idea of the most common starting word, we will look at the starting words used in the Reddit comments.

In [104]:
wordcloud = WordCloud(width = 1600, height = 900,
                background_color ='white',
                min_font_size = 10).generate_from_frequencies(r_scores['word'].value_counts())

fig = px.imshow(wordcloud, title='Word cloud of the most popular starting words on Reddit')
fig.update_layout(margin=dict(l=10,r=10,t=40,b=10),hovermode=False)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()

It can be seen that the most used starting word on Reddit is 'stare', followed by 'crane'. Many of these words contain multiple vowels, with a and e being especially common.

To more clearly show this, I have displayed the top 20 most popular starting words below with their count.

In [105]:
r_scores['word'].value_counts().rename('count').to_frame().head(20)
Out[105]:
count
stare 3085
crane 1616
raise 1387
adieu 1143
crate 966
audio 897
slate 834
trace 731
arise 709
salet 632
roate 576
tares 506
house 490
stern 472
train 430
tears 425
crwth 401
soare 395
irate 392
stear 387

Recommendations or Further Works¶

With more time and further analysis, I'm sure a more in-depth look into Wordle can be acheived. For instance, more variables such as the amount of interactions on social media posts can be considered to provide more insight on trends and patterns. More detailed analysis, modelling and simulations can also be used to consider the difficultly of words, and the effects of different starting words and stategies. However, without data from Wordle itself, there is a limit to how accurate such analysis can be.